This is the live session work space for the course. Our goal with this repository, is that we’re able to communicate ahead of time our aims for each week, and that you can prepare accordingly.
source('./src/blank_lines.R')
Probability is a system of reasoning about the world in the face of incomplete information. In this course, we’re going to develop an understanding of the implications of core parts of this theory, how this theory was developed, and how these implications relate to every other part of the practice of data science.
probability, the final frontier
At the end of this week’s learning, student will be able to:
At this point in the course, there is so much that is before us! As we settle in to study for the semester, it is useful to have a point of view of where we’re trying to go, and what we are going to see along the way.
Allow a justification by analogy:
Suppose that you decide that you would like to be a chef – all of the time watching cooking shows has revealed to you that this is your life’s true calling – and so you enroll in a culinary program.
One does not begin such a program by baking croissants and souffle. They begin the program with knife skills, breaking down ingredients and the basic techniques that build up to produce someone who is not a cook, but a chef – someone who can combine ingredients and techniques to produce novel ideas.
At the same time, however, one has not gone to school just to become a cucumber slicer. The knife skills are instrumental to the eventual goal – of being a chef – but not the goal itself.
At the beginning of the program, we’re teaching these core, fundamental skills. How to read and reason with mathematical objects, how to use conditional probability with the goal of producing a model, and eventually, eventually to create novel work as a data scientist.
At the end of this course, students will be able to:
Demonstrate programming proficiency by translating statistical problems into code. 1. Understand and incorporate best practices for coding style and data carpentry 2. Utilize industry standard tooling for collaboration
The instructors for the course come to the program, and to statistics from different backgrounds. Instructors hold PhDs in statistics, astrophysics, biology, political science, computer science, and information.
Identity shapes how people approach and understand their world.
We would like to acknowledge that we have limited diversity of identity among the instructors for this course. We each have been fortunate to be able to study, but we want to acknowledge that the education system in the US has systematically benefited the hegemonic groups and marginalized others voices.
Every one of the instructors shares a core identity as an empathetic educator that wants to understand your strengths, areas for growth, and unique point of view that is shaped by who you are. We want to see a field of data scientists who embrace each others voices, and respects people for the identies that they hold.
In a breakout room of between three and four students introduce yourself!
Breakout One. A name story is the unique, and individual story that describes how you came to have the name that you do. While there may be many people are called the same thing, each of their name stories is unique.
Please share: What is your name story?
In the same breakout room:
Breakout Two. Like our names, the reasons that we joined this program, our goals and our histories are different.
Please share: What is your data science story? How did you wind up here, in this room today?
Probability
Probability is a system of reasoning that we use to model the world under incomplete information. This model underlies virtually every other model you’ll ever use as a data scientist.
told you this would be spacey
In this course, probability theory builds out to random variables; when combined with sampling theory we are able to develop p-values (which are also random variables) and an inferential paradigm to communicate what we know and how certain a statement we can make about it.
In introduction to machine learning, literally the first model that you will train is a naive bayes classifier, which is an application of Bayes’ Theorem, trained using an iterative fitting algorithm. Later in machine learning, you’ll be fitting non-linear models, but at every point the input data that you are supplying to your models are generated from samples from random variables. That the world can be represented by random variables (which we will cover in the coming weeks) means that you can transform – squeeze and smush, or stretch and pull – variables to heighten different aspects of the variables to produce the most useful information from your data.
As you move into NLP, you might think of generative text as a conditional probability problem: given some particular set of words as an input, what is the most likely next word or words that someone might type?
Beyond the direct instrumental value that we see working with probability, there are two additional aims that we have in starting the course in the manner.
First, because we are starting with the axioms of probability as they apply to data science statistics, students in this course develop a much fuller understanding of classical statistics than students in most other programs. Unfortunately, it is very common for students and then professionals to see statistics as a series of rules that have to be followed absolutely and without deviation. In this view of statistics, there are distributions to memorize; there are repeated problems to solve that require the rote application of some algebraic rule (i.e. compute the sample average and standard deviation of some vector); and, there are myriad, byzantine statistical tests to memorize and apply. In this view of statistics, if the real-world problem that comes to you as a data scientist doesn’t clearly fit into a box, there’s no way to move forward.
Statistics like this is not fun.
In the way that we are approaching this course, we hope that you’re able to learn why certain distributions (like the normal distribution) arise repeatedly, and why we can use them. We also hope that because you know how sampling theory and random variables combine, that you can be more creative and inventive to solve problems that you haven’t seen before.
The second additional aim that we have for this course is that it can serve as either an introduction or a re-introduction to reading and making arguments using the language of math. For some, this will be a new language; for others, it may have been some years since they have worked with the language; for some, this will feel quite familiar. New algorithms and data science model advancements nearly always developed in the math first, and then applied into algorithms second. In our view, being a literate reader of graduate- and professional-level math is a necessary skill for any data scientist that is going to keep astride of the field as it continues to develop and these first weeks of the course are designed to bring everyone back into reading and reasoning in the language.
The book makes a point of defining our axioms of probability, calling them them
Kolmogorov Axioms
Let \(\Omega\) be a sample space, \(S\) be an event space, and \(P\) be a probability measure. Then, \((\Omega, S, P)\) is a probability space if it satisfies the following:
\[ P(A_1 \cup A_2 \cup A_3 \cup \dots) = P(A_1) + P(A_2) + P(A_3) = \sum_{i}P(A_{i}) \]
There is a lot going on in this definition!
First things first, these are the axioms of probability (read aloud in the booming voice of a god).
This means that these are things that we begin from, sort of the foundational principles of the entire system of reasoning that we are going to use. In the style of argument that we’re going to make, these are things that are sort of off-limits to question. Instead, these serve as the grounding assumptions, and we see what happens as we flow forward from these statements.
Second, and importantly, from these axioms there are a very large set of things that we can build. The first set of things that we will build are probability statements about atomic outcomes (Theorem 1.1.4 in the book), and collections of events. But, these statements, are not the only thing that we’re limited to. We can also build Frequentist Statistics, and Bayesian Statistics and Language Models.
In many ways, these axioms are the fundamental particles that hold our system of probabilistic reasoning together. These are to probability what the fermions and and bosons are to physics.
What is the difference between a definition and a theorem? On pages 10 and 11 of the textbook, there is a rapid fire collection of pink boxes. We reproduce them here (notice that they may have different index numbers than the book – this live session book autoindexes and we’re not including every theorem and definition in this live session discussion guide).
Conditional Probability For \(A, B \in S\) with \(P(B) > 0\), the conditional probablity of \(A\) given \(B\) is \[P(A|B) = \frac{P(A\cap B)}{P(B)}.\]
Multiplicative Law of Probability For \(A, B \in S\) with \(P(B) > 0\), \[P(A|B)P(B) = P(A \cap B)\]
Baye’s Rule For \(A, B \in S\) with \(P(A) > 0\) and \(P(B) > 0\), \[P(A|B) = \frac{P(B|A)P(A)}{P(B)}.\]
As a way to begin lets define terms that we will use for the next activities.
Group Discussion Question
The book provides a (characteristically) terse statement of what it means for two events to be independent of one another.
Independence of Events Events \(A, B \in S\) are independent if \[P(A \cap B) = P(A)P(B)\].
In your own words:
Try using this idea of independent in two places:
The last task for us to complete today is working through a practice problem on the course practice problem website. Please, click the link below, and follow us over to the the course’s practice problem website.
Before next live session, please complete the homework that builds on this unit. There are two parts, an applied and a proof part. You can submit these homework as many times as you like before the due date (you will not receive feedback), and you can access this homework through bCourses.
The applied homework will be marked either
Correct or Incorrect without partial credit
applied. These are meant to be problems that you solve, and that have a
single straightforward solution concept. The proof homework
will be marked for partial credit (out of three points) that evaluates
your argument for your solution concept.
yosemite valley
At the end of this week’s course of study (which includes the async, sync, and homework) students should be able to
From the axioms of probability, it is possible to build a whole, expressive modeling system (that need not be grounded at all in the minutia of the world). With this probability model in place, we can describe how frequently events in the random variable will occur. When variable are dependent upon each other, we can utilize information that is encoded in this dependence in order to make predictions that are closer to the truth than predictions made without this information.
There is both a beauty and a tragedy when reasoning about random variables: we describe random variables using their joint density function.
The beauty is that by reasoning with such general objects – the definitions that we create, and the theorems that we derive in this section of the course – produce guarantees that hold in every case, no matter the function that stands in for the joint density function. We will compute several examples of specific functions to provide a chance to reason about these objects and how they “work”.
The tragedy is that in the “real world”, the world where we are going to eventually going to train and deploy our models, we are never provided with this joint density function. Perhaps this is the creation myth for probability theory: in a perfect world, we can produce a perfect result. But, in the “fallen” world of data, we will only be able to produce approximations.
It is a very good idea for you to create a recurring time to work with a set of your classmates. Working together will help you solve questions more effectively, quickly, and will also help you to learn how to communicate what you do and do not understand about a problem to a group of collaborating data scientists. And, working together with a group will help you to find people who share data science interests with you.
There are several resources to support your learning. A learning object last week was that you would be introduced to each of these systems. Please continue to make sure that you have access to the:
What is a random variable? Does this definition help you?
A random variable is a function \(X : \Omega \rightarrow \mathbb{R},\) such that \(\forall r \in \mathbb{R}, \{\omega \in \Omega\}: X(\omega) \leq r\} \in S\).
Someone, please, read that without using a single “omega”, \(\mathbb{R}\), or other jargon terminology. Instead, someone read this aloud and tell us what each of the concepts mean.
The goal of writing with math symbols like this is to be absolutely clear what concepts the author does and does not mean to invoke when they write a definition or a theorem. In a very real sense, this is a language that has specific meaning attached to specific symbols; there is a correspondence between the mathematical language and each of our home languages, but exactly what the relationship is needs to be defined into each student’s home language.
Why do we say that random variables are functions? Is there some useful property of these being functions rather than any other quantity?
What about a function of a random variable, which is a function of a function.
Let \(g : U \rightarrow \mathbb{R}\) be some function, where \(X(\Omega) \subset U \subset \mathbb{R}\). Then, if \(g \circ X : \Omega \rightarrow \mathbb{R}\) is a random variable, we say that \(g\) is a function of X and write \(g(X)\) to denote the random variable \(g \circ X\).
If a random variable is a function from the real world, or the sample space, or the outcome space to a real number, then what does it mean to define a function of a random variable?
Suppose that you are doing some image processing work. To keep things simple, that you are doing image classification in the style of the MNIST dataset.
However, suppose that rather than having good clean indicators for whether a pixel is on or off, instead you have weak indicators – there’s a lot of grey. A lot of the cells are marked in the range \(0.2 - 0.3\).
When you look at the definition of a random variable, there are two key pieces that must exist for every random variable. What are these pieces? As a hint, we actually developed one of these key pieces last week when we were discussing elementary probability.
A random variable is a function \(X : \Omega \rightarrow \mathbb{R},\) such that \(\forall r \in \mathbb{R}, \{\omega \in \Omega\}: X(\omega) \leq r\} \in S\).
You should try driving in Berkeley some time. It is a trip! Without being deliberately ageist, the city is full of ageing hippies driving Subaru and making what seem to be stochastic right-or-left turns to buy incense, pottery, or just sourdough bread.
Suppose that you are walking to campus, and you have to cross 10 crosswalks, each of which are spaced a block apart. Further, suppose that as you get closer to campus, there are fewer aging hippies, and therefore, there is decreasing risk that you’re hit by a Subaru as you cross the street. Specifically, and fortunately for our math, the risk of being hit decreases linearly with each block that you cross.
Finally, campus provides you with the safety reports from last year, and reports that there were 120 student-Subaru incidents last year, out of 10,000 student-crosswalk crossings.
What, if anything is fundamentally different between discrete and continuous random variables? As a way of starting the conversation, consider the following cases:
The book defines pmf and cmf first as a way of developing intuition and a way of reasoning about these concepts. It then moves to defining continuous density functions, which is many ways are easier to work with although they lack the means of reasoning about them intuitively. Continuous distributions are defined in the book, and more generally, in terms of the cdf, which is the cumulative density function. There are technical reasons for this choice of definition, some of which are signed in the footnotes on the page where the book presents it.
More importantly for this course, in Definition 1.2.15 the book defines the relationship between cdf and pdf in the following way:
For a continous random variable \(X\) with CDF \(F\), the probability density function of \(X\) is
\[ f(x) = \left. \frac{d F(u)}{du} \right|_{u=x}, \forall x \in \mathbb{R}. \]
Suppose that you learn than a particular random variable, \(X\) has the following function that describes its pdf, \(f_{x}(x) = \frac{1}{10}x\). Also, suppose that you know that the smallest value that is possible for this random variable to obtain is 0.
Working with a single random variable helps to develop our understanding of how to relate the different features of a pdf and a cdf through differentiation and integration. However, there’s not really that much else that we can do; and, there is probably very little in our professional worlds that would look like a single random variable in isolation.
We really start to get to something useful when we consider joint density functions. Joint density functions describe the probability that both of two random variables. That is, if we are working with random variables \(X\) and \(Y\), then the joint density function provides a probability statement for \(P(X \cap Y)\).
In this course, we might typically write this joint density function as \(f_{X,Y}(x,y) = f(\cdot)\) where \(f(\cdot)\) is the actual function that represents the joint probability.
Suppose that we know that two variables, \(X\) and \(Y\) are jointly uniformly distributed within the the support \(x \in [0,4], y \in [0,4]\). We have a requirement, imposed by the Kolmogorov Axioms that all probabilities must be non-zero, and that the total probability across the whole support must be one.
knitr::include_app('http://www.statistics.wtf/PDF_Explorer/')
Suppose that you know that two random variables, \(X\) and \(Y\) are jointly distributed with the following pdf:
\[ f_{X,Y}(x,y) = \begin{cases} a * x^{2} * y^{2} & 0 < x < 1, 0 < y < 1 \\ 0 & otherwise \end{cases} \]
Here is the true density function for a normal random variable.
Simulate Draws
There’s another way to get an approximate idea of what the distribution looks like. Here’s how we take a single draw from a normal distribution with a specific set of features:
rnorm(n = 1, mean = 2, sd = 1)
## [1] 2.071955
Repeating the Experiment
We want to rerun that experiment 10 times. We take a draw, then
rewind time, clear our memory and start over with fresh randomness. To
do this in R, an easy way is with the replicate() function.
Change the code below so that it repeats the experiment above 10 times,
then use hist() to display a plot of the result.
simulation <- replicate(
n = 10, # should you change this line?
expr = 1 + .2 # or this line?
)
simulation
## [1] 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2
Better Visualization
Here’s some fancy ggplot code to draw a nice histogram of the result, along with the true density. Remove the first line to make it work with your simulation.
short_simulation <- c(1,2,3,2,4,2)
true_density <- function(x) {
dnorm(x = x, mean = 2, sd = 1)
}
dat_hist <- data.frame(short_simulation)
dat_hist %>%
ggplot() +
geom_histogram(
aes(x = short_simulation, y = ..density..)) +
stat_function(
aes(x = short_simulation), fun = true_density, color = 'darkred')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Repeating the Experiment: Questions
The Visualization Trick
Part I
How can the visualization trick help us? Here’s a problem:
We could do some math to compute the density function of \(Z\), but it’s actually quite messy. Instead, let’s use the visualization trick to get an approximate idea.
Part II
rz <- function() {
2 # Replace with your code
}
Suppose that random variables \(X\) and \(Y\) are jointly continuous, with joint density function given by,
\[ f(x,y) = \begin{cases} c, & 0 \leq x \leq 1, 0 \leq y \leq x \\ 0, & otherwise \end{cases} \]
where \(c\) is a constant.
cake
Review of Terms
Remember some of the key terms we learned in the async:
Explain each of these three in terms of the cake metaphor.
knitr::include_app(url = 'http://statistics.wtf/betahat/', height = '1200px')
a majestic valley
At the end of the live session and homework this week, students will be able to
Where have we come from, and where are we going?
What might seem frustrating about this probability theory system of reasoning is that we are building a castle in the sky – a fiction. We’re supposing that there is some function that describes the probability that values are generated. In reality, there is no such generative function; it is extremely unlikely (though we’ll acknowledge that it is possible) that the physical reality we belive we exist within is just a complex simulation that has been programmed with functions by some unknown designer.
Especially frustrating is that we’re supposing this function, and then we’re further saying,
“If only we had this impossible function; and if only we also had the ability to take an impossible derivative of this impossible function, then we could…”
But, here’s the upshot!
What we are doing today is laying the baseline for models that we will introduce next week. Here, we are going to suggest that there are radical simplifications that we can produce that hold specific guarantees, no matter how complex the function that we’re reasoning about.
In particular, in one specific usage of the term best we will prove that the Expectation operation is the best one-number summary of any distribution. To do so, we will define a term, variance, which is the squared deviations from the expectation of a variable that describes how “spread out” is a variable. Then, we will define a concept that is the mean squared error that is the square of the distance between a model prediction and a random variable’s realization. The key realization is that when the model predicts the expectation, then the MSE is equal to the variance of the random variable, which is the smallest possible value it could realize.
Although the single number summaries are incredibly powerful, that’s not enough for today’s lesson! We’re also going to suggest that we can create a measure of linear dependence between two variables that we call the “covariance”, and a related, rescaled version of this relationship that is called the correlation.
Together with your instructor, you will talk about what each of the following definitions mean in your own words, for the key concepts, you might also formalize this intuition into a formula that can be computed.
without using specific numbers, describe the process you would use to calculate the expected value of this distribution.
How does the following formula match with your concept of expectation from that last section?
\[ E(X) = \sum_{x \in \{years\}} x \cdot P(X=x) \]
Let \(X\) represent the result of one roll of a 6 sided die.
Fill this in by hand.
Let \(X\) represent the result of one roll of a 6 sided die.
Let \(X\) represent the result of one roll of a 6 sided die.
die <- data.frame(
value = 'fill this in',
prob = 'fill this in'
)
Let \(X\) represent the result of one roll of a 6 sided die.
die
object.variance_function <- function(die) {
## fill this in
mu = 'fill this in'
var = 'fill this in'
return(var)
}
variance_function(die)
## [1] "fill this in"
Suppose that you had to keep the values the same on the die, that is. That is the domain of the outcome still had to be the countable set of integers from one to six.
variance_function from above to actually
compute this variance.variance_function from above to
actually compute this variance.Now suppose that you again had an equal probability of every outcome, but you were to apply a function to the number of spots that are showing on the die. Rather that each dot contributing one value to the random variable, instead the random variable’s outcome is the square of the number of spots.
Suppose that \(X\) has the following density function:
\[ f_{X}(x) = \begin{cases} 6x(1 - x), & 0 < x < 1 \\ 0, & otherwise \\ \end{cases} \]
Suppose that \(X\) and \(Y\) have joint density \(f_{X,Y}(x,y) = 10 x^2y\) for $0 < y < x < 1. $
mt. tamalpais
One of our most fundamental goals as data scientists is to produce predictions that are good. In this week’s async, we make a statement of performance that we can use to evaluate how good a job a predictor is doing, choosing Mean Squared Error.
With the goal of minimizing MSE, then we then present, justify, and prove that the conditional expectation function (the CEF) is the globally best possible predictor. This is an incredible powerful result, and one that serves as the backstop for every other predictor that you will ever fit, whether that predictor is a “simple” regression, or that predictor is a machine learning algorithms (e.g. a random forest) or a deep learning algorithm. Read that again – even the most technologically advanced algorithms cannot possibly perform better than the conditional expectation function.
Why does the CEF do so well? Because it can contain a vast amount of complex information and relationships; in fact, the complexity of the CEF is a product of the complexity of the underlying probability space. If that is the case, then why don’t we just use the CEF as our predictor every time?
Well, this is one of the core problems of applied data science work: we are never given the function that describes the behavior of the random variable. And so, we’re left in a world where we are forced to produce predictions from simplifications of the CEF. A very strong simplification, but one that is useful for our puny human brains, is to restrict ourselves to predictors that make predictions from a linear combination of input variables.
Why should we make such a strong restriction? After all, the conditional expectation function might be a fantastically complex combination of input features, why should we entertain functions that are only linear combinations? Essentially, this is because we’re limited in our ability to reason about anything more complex than a linear combination.
thunder struck
At the end of this weeks learning, which includes the asynchronous lectures, reading the textbook, this live session, and the homework associated with the concepts, student should be able to
The first test is releasing today. There are review sessions scheduled for this week, practice tests available, and practice problems available. The format for the test is posted in the course discussion channel. In addition to your test, your instructor will describe your responsibilities that are due next week.
Think back to remember the definition of the expectation of \(Y\):
\[ E[Y] = \int_{-\infty}^\infty y \cdot f_{Y}(y) dy \]
This week, in the async reading and lectures we added a new concept, the conditional expectation of \(Y\) given \(X=x \in \text{Supp}[X]\):
\[ E[Y|X=x] = \int_{-\infty}^\infty y \cdot f_{Y|X}(y|x) dy \]
Compare and contrast \(E[Y]\) and \(E[Y|X]\). For example, when you look at how these operators are “shaped”, how are their components similar or different?2
What is \(E[Y|X]\) a function of? What are “input” variables to this function?
What, if anything, is \(E[E[Y|X]]\) a function of?
\[ f(x,y) = \begin{cases} 2, & 0 \leq x \leq 1, 0 \leq y \leq x \\ 0, & otherwise \end{cases} \]
What does the joint PDF of this function look like?
To begin with, let’s compute the simplest quantities:
And then, let’s think about how to compute the conditional quantites:
Suppose that you were told that \(X\) and \(Y\) are jointly distributed according to the following joint density function:
\[ f_{X,Y}(x,y) = \begin{cases} 2 & 0 \leq x \leq 1, 0 \leq y \leq x \\ 0 & otherwise \end{cases} \]
What is the conditional expectation function? To get started, you can use the fact that in week two, we already computed the conditional probability density function:
\[ f_{Y|X}(y|x) = \begin{cases} \frac{1}{x}, & 0 \leq y \leq x \\ 0, & \text{otherwise.} \end{cases} \]
Theorem 2.2.20 states,
The CEF \(E[Y|X]\) is the “best” predictor of \(Y\) given \(X\), where “best” means it has the smallest mean squared error (MSE).
Oh yeah? As a breakout group, ride shotgun with us as we prove that the conditional expectation is the function that produces the smallest possible Mean Squared Error.
Specifically, you group’s task is to justify every transition from one line to the next using concepts that we have learned in the course: definitions, theorems, calculus, and algebraic operations.
We need to find such function \(g(X): \mathbb{R} \to \mathbb{R}\) that gives the smallest mean squared error.
First, let MSE be defined as it is in Definition 2.1.22.
For a random variable \(X\) and constant \(c \in \mathbb{R}\), the mean squared error of \(X\) about \(c\) is \(E[(x-c)^2]\).
Second, let us note that since \(g(X)\) is just a function that maps onto \(\mathbb{R}\), that for some particular value of \(X=x\), \(g(X)\) maps onto a constant value.
\[ \begin{aligned} E[(Y - g(X))^2|X] &= E[Y^2 - 2Yg(X) + g^2(X)|X] \\ &= E[Y^2|X] + E[-2Yg(X)|X] + E[g^2(X)|X] \\ &= E[Y^2|X] - 2g(X)E[Y|X] + g^2(X)E[1|X] \\ &= (E[Y^2|X] - E^2[Y|X]) + (E^2[Y|X] - 2g(X)E[Y|X] + g^2(X)) \\ &= V[Y|X] + (E^2[Y|X] - 2g(X)E[Y|X] + g^2(X)) \\ &= V[Y|X] + (E[Y|X] - g(X))^2 \\ \end{aligned} \]
Notice too that we can use the Law of Iterated Expectations to do something useful. (This is a good point to talk about how this theorem works in your breakout groups.)
\[ \begin{aligned} E[(Y-g(X))^2] &= E\big[E[(Y-g(X))^2|X]\big] \\ &=E\big[V[Y|X]+(E[Y|X]-g(X))^2\big] \\ &=E\big[V[Y|X]\big]+E\big[(E[Y|X]-g(X))^2\big]\\ \end{aligned} \]
\(\therefore g(X) = E[Y|X]\) gives the smallest \(E[(Y-g(X))^2]\)
If you are choosing some \(g\), you can’t do better than \(g(x) = E[Y|X=x]\).
Why Linear?
In some cases, we might try to estimate the CEF. More commonly, however, we work with linear predictors. Why?
We don’t know joint density function of \(Y\). So, it is “difficult” to derive a suitable CEF.
To estimate flexible functions requires considerably more data. Assumptions about distribution (e.g. a linear form) allow you to leverage those assumptions to learn ‘more’ from the same amount of data.
Other times, the CEF, even if we could produce an estimate, might be so complex that it isn’t useful or would be difficult to work with.
And, many times, linear predictors (which might seem trivially simple) actually do a very good job of producing predictions that are ‘close’ or useful.
##Joint Distribution Practice
Let the number of questions that students ask be a RV, \(X\).
Let \(X\) take on values: \(\{1, 2, 3\}\), each with probability \(1/3\).
Every time a student asks a question, the instructor answers incorrectly with probability \(1/4\), independently of other questions.
Let the RV \(Y\) be number of incorrect responses.
Questions:
south hall
At the end of this week, students will be able to
t.tests,
wilcox.test and the likeFor each scenario, is the IID assumption plausible?
Define each of the following:
Questions:
Which Result is Better?
| Early Rising | Mindfulness Retreat | MIDS Degree | |
|---|---|---|---|
| Increase in Salary | $1020 | $5130 | $9200 |
| \(SE\) | ($350) | ($4560) | |
| \(N =\) | 1,000 | 77 | 700 |
(Standard errors in parentheses when available)
Errors in stating standard errors frequently occur. (We wanted to start with what might the most vapid, but also confusing statement possible!)
Standard errors are a statement about the sampling variance of the sample average. But, related to this concept are the ideas of the Population Variance, the Plug-In Estimator for the Sample Variance, the Unbiased Sample Variance, and, finally, the Sampling Variance of the Sample Average (i.e the Standard Error).
How are each of these concepts related to one another, and how can we keep them all straight? As a group, fill out the following columns?
| Population Concept | Sample Estimator | Estimator Properties | Sampling Variance of Sample Estimator |
|---|---|---|---|
| Expected Value | |||
| Population Variance | |||
| Population Covariance | |||
| CEF | |||
| BLP |
p=0.5 so your population distribution is
symmetric. Use a variable n to represent your sample size.
Initially, set n=3.n <- 3
p <- 0.5
sample() or rbinom()
bernoulli function.samplerbinom (the Bernoulli distribution is a special
case of a binomial distribution. In this function, size
refers to a distribution parameter, not the number of draws.)sample(x=0:1, size=n, replace=TRUE, prob=c(1-p, p))
rbinom(n=n, size=1, prob=p)
## [1] 0 0 0
## [1] 0 0 0
replicate()
replicatereplicate(10, log(10))
## [1] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585
ggplot() +
aes(x = rnorm(100)) +
geom_histogram(bins = '8')
Part 1
Throughout this part, we will use fair coins
(p = 0.5).
mean function to compute the
sample mean – the average of the number of heads that are showing. Make
sure that when you run it, you return values in \(\{0,1/3,2/3,1\}\).experiment <- function(n, p){
}
The sample mean is a random variable. To understand it, use the
visualization trick from a few weeks ago. Use the replicate
function to run the above experiment 1000 times, and plot a histogram of
the results.
If you replicate the experiment enough times, will the distribution ever look normal? Why or why not?
Use sd() to check the standard deviation of the
sampling distribution of the mean for number_of_coins = 3.
What sample size is needed to decrease the standard deviation by a
factor of 10? Check that your answer is correct.
Part 2
For this part, we’ll continue to study a fair coin.
Part 3
For this part, we’ll study a very unfair coin.
p = 0.01.
This is an example of a highly skewed random variable. That roughly means that one tail is a lot longer than the other.
For this activity, you can simply use your eyes to gauge how skewed a distribution is. If you prefer, you can also use the skewness command in the univar package to measure skewness. You may hear a rule of thumb that a skewness above 1 or below -1 is a highly skewed distribution.
Start with n=3 as before. What do you notice about the shape of the sampling distribution?
Try different values for the sample size n, and examine the shape of the sampling distribution of the mean. At what point does it look normal to you?
How does the skewness of the population distribution affect the applicability of the Central Limit Theorem? What lesson can you take for your practice of statistics?
Name a variable you would be interested in measuring that has a substantially skewed distribution.
One definition of a heavy tailed distribution is one with
infinite variance. For example, you can use the rcauchy
command in R to take draws from a Cauchy distribution, which has heavy
tails. Do you think a “heavy tails” distribution will follow the CLT?
What leads you to this intuition?
Frequentist Hypothesis testing is a very well established framework in the applied practice, and scientific literature. Sometimes (often, currently) referred to as Null Hypothesis Significance Testing, this framework essentially makes an absurd assumption and asks the data to overturn that absurd assumption.
Like a petulant child, NHST essentially proclaims,
“Oh yeah? Well, if you actually loved me then you would buy me an iPad!”
Which is to say, it says, “Oh yeah? If this absurd thing isn’t true, prove it to me!”
update!
This testing framework works on samples of data, and applies estimators to produce estimates of population parameters that are fundamentally unknown and unknowable. Despite this unknown and unknowable population target, with some carefully written down estimators we can rely on the convergence characteristics of some estimators to produce useful, reliable results.
We begin with the one-sample t-test. The one-sample t-test relies on the sample average as an estimator of a population expectation. In doing so, it relies on the effectiveness of the Weak Law of Large Numbers and the Central Limit Theorem to guarantee that the estimator that converges in probability to the population expectation, while also converging in distribution to a Gaussian distribution.
These two convergences permit a data scientist to make several inferences based on data:
This framework begins a exceedingly important task that we must understand, and undertake when we are working as data scientists: Producing our best-estimate, communicating how we arrived at that estimate, what (if any) guarantees that estimate provides, and crucially all limitations of our estimate.
Looking Backwards
Today
Looking Ahead
\[ t = \frac{ \overline{X}_n - \mu}{\sqrt{\frac{s^2}{n}}} \]
tone_of_voice <- 'yell'
mode_of_speech <- 'call'
superlative <- 'worst'
score_on_test <- '70'
name_of_classmate <- 'victor'
emotion <- 'ennui'
eating_verb <- 'smack' #slurp
vessel <- 'bowl'
thing_found_in_compost <- 'bannana peel'
Suppose that a classmate comes to you, and, in a yell voice call, “Hey, I’ve got something that is worst for statistics. All you’ve got to do to get 70 on Test 2 and make victor ennui is smack this bowl of bannana peel.
You’re skeptical, but also curious because that last test was tough. Good.
Acknowledging that we’re only 40% of the way through the course, you decide to hire a hungry, underpaid PhD student the School to conduct the experiment to evaluate this claim. They report back, no details about the test, but they do tell you, “We’re sure there’s no effect of bannana peel.”
Explain this joke:
har har har
In a warehouse full of power packs labeled as 12 volts we randomly measured the voltage of 7. Here is the data:
voltage <- c(11.77, 11.90, 11.64, 11.84, 12.13, 11.99, 11.77)
Find the mean and the standard deviation.
Using qt(), compute the t critical value for a
hypothesis test for this sample.
Define a test statistic, \(t\), for testing whether the population mean is 12.
Calculate the p-value using the t statistic.
Should you reject the null? Argue this in two different ways. (Following convention, set \(\alpha = .05\).)
Suppose you were to use a normal distribution instead of a t-distribution to test your hypothesis. What would your p-value be for the z-test?
Without actually computing it, say whether a 95% confidence interval for the mean would include 12 volts.
Compute a 95% confidence interval for the mean.
t-Test Micro Cheat Sheet
Testing the Home Team Advantage
The file athlet2.Rdata contains data on college football games. The data is provided by Wooldridge and was collected by Paul Anderson, an MSU economics major, for a term project. Football records and scores are from 1993 football season.
load("data/athlet2.RData")
data
## dscore dinstt doutstt htpriv vtpriv dapps htwrd vtwrd dwinrec dpriv
## 1 10 -409 -4679 0 0 -1038 1 1 0 0
## 2 -14 NA -66 0 0 -7051 1 1 0 0
## 3 23 -654 -637 0 0 6209 1 0 1 0
## 4 8 -222 456 0 0 -129 1 1 0 0
## 5 -12 -10 208 0 0 794 1 1 0 0
## 6 7 494 17 0 0 411 0 0 0 0
## 7 -21 2 2 0 0 -4363 1 1 0 0
## 8 -5 96 -333 0 0 1144 1 0 1 0
## 9 -3 223 2526 0 0 3956 0 0 0 0
## 10 -32 -20 0 0 0 -641 0 1 -1 0
## 11 9 66 0 0 0 -278 1 0 1 0
## 12 1 56 -346 0 0 -2223 1 0 1 0
## 13 7 556 717 0 0 -5217 1 0 1 0
## 14 -20 169 -461 0 0 1772 0 1 -1 0
## 15 35 -135 396 0 0 85 1 0 1 0
## 16 35 -40 0 0 0 -988 1 0 1 0
## 17 -25 24 0 0 0 -8140 1 1 0 0
## 18 -9 90 0 0 0 8418 0 1 -1 0
## 19 -33 27 900 0 0 -3273 0 0 0 0
## 20 7 -89 -31 0 0 1906 1 0 1 0
## 21 -3 536 2352 0 0 -151 1 1 0 0
## 22 -6 13261 9111 1 0 -9936 1 1 0 1
## 23 -29 13809 10076 1 0 -6265 0 1 -1 1
## 24 14 -17631 -10589 0 1 1252 1 0 1 -1
## 25 -18 14885 9983 1 0 -4529 1 1 0 1
## 26 48 -15220 -11400 0 1 -318 1 0 1 -1
## 27 -3 99 -29 0 0 -797 0 1 -1 0
## 28 -3 -54 -88 0 0 -372 0 1 -1 0
## 29 -3 -98 -4175 1 0 2460 1 1 0 1
## 30 2 -304 2987 0 1 -3035 1 1 0 -1
We are especially interested in the variable, dscore, which represents the score differential, home team score - visiting team score. We would like to test whether a home team really has an advantage over the visiting team.
The instructor will assign you to one of two teams. Team 1 will argue that the t-test is appropriate to this scenario. Team 2 will argue that the t-test is invalid. Take a few minutes to examine the data, then formulate your best argument.
Should you perform a one-tailed test or a two-tailed test? What is the strongest argument for your answer?
Execute the t-test and interpret every component of the output.
Based on your output, suggest a different hypothesis that would have led to a different test result. Try executing the test to confirm that you are correct.
For the following scenarios, what is the strongest argument against the validity of a t-test?
You have a sample of 50 CEO salaries, and you want to know whether the mean salary is greater than $1 million.
A nonprofit organization measures the percentage of students that pass an 8th grade reading test in 40 neighboring California counties. You are interested in whether the percentage of students that pass in California is over 80%
You have survey data in which respondents assess their own opinion of corgis, with options ranging from “1 - extreme disgust” to “5 - affection so intense it threatens my career.” You want to know whether people on the average like corgis more than 3, representing neutrality.
This week, we introduce the idea of comparing two groups to evaluate whether the data that we have sampled lead us to believe that the population distribution of the random variables are different. Of course, because we don’t get access to the function that describes the random variable, we can’t actually know if the populations are different. It is for this reason that we call it statistical inference – we are inferring from a sample some belief about the populations.
At the conclusion of this week, students will be able to:
Data science is a beautiful combination of team-work and individual-work. It provides the opportunity to work together on a data pipeline with people from all over the organization, to deal with technical, statistical, and social questions that are always interesting. While we expect that everyone on a team will be a professional, there is so much range within the pursuit of data science that projects are nearly always collaborative exercises.
Together as teams, we
Together as individuals we conduct the heads-down work that moves question answering forward. This might be reading papers to determine the most appropriate method to bring to bear on the question, or researching the data that is available, or understanding the technical requirements that we have to meet for this answer to be useful to our organization in real time.
What is your instructor uniquely capable of? Let them tell you! But, at the same time, what would they acknowledge that others are better than them?
See, the thing is, because there is so much that has to be done, there literally are very, very few people who are one-stop data science shops. Instead, teams rely on collaboration and joint expertise in order to get good work done.
People talk about the impostor syndrome: a feeling of inadequacy or interloping that is sometimes also associated with a fear of under-performing relative to the expectation of others on the team. These emotions are common through data science, academics, computer science. But, these types of emotions are also commonplace in journalism, film-making, and public speaking.
Has anybody ever had the dream that they’re late to a test? Or, that that they’ve got to give a speech that they’re unprepared for? Does anybody remember playing an instrument as a kid and having to go to recitals? Or, play for a championship on a youth sports team? Or, go into a test two?
What are the feelings associated with those events? What might be generating these feelings?
What really matters to creating an effective tema is less about who is on the team, and more about how the team works together.
In your live session, your section might take 7 minutes to read this brief. If so, please read the following sections:
“Psychological safety refers to an individual’s perception of the consequences of taking an interpersonal risk. It is a belief that a team is safe for risk taking in the face of being seen as ignorant, incompetent, negative, or disruptive.”
“In a team with high psychological safety, teammates feel safe to take risks around their team members. They feel confident that no one on the team will embarrass or punish anyone else for admitting a mistake, asking a question, or offering a new idea.”
How can you add to the psychological safety of your peers in the section and lab teammates?
Lab 1 Teams
Team Kick-Off Conversation
Review of Key Terms
Comparing Groups Review
Take a moment to recall the tests you learned this week. Here is a quick cheat-sheet to their key assumptions.
| paired/unpaired | parametric | non-parametric |
|---|---|---|
| unpaired | unpaired t-test - metric var - i.i.d. - (not too un-)normal |
Wilcoxon rank-sum ordinal var i.i.d. |
| paired | paired t-test metric var i.i.d. (not too un-)normal |
Wilcoxon signed-rank metric var i.i.d. difference is symmetric sign test ordinal var i.i.d. |
The General Social Survey (GSS) is one of the longest running and
extensive survey projects in the US. The full dataset includes over 1000
variables spanning demographics, attitudes, and behaviors. The file
GSS_w203.RData contains a small selection of a variables
from the 2018 GSS.
To learn about each variable, you can enter it into the search bar at the GSS data explorer
load('data/GSS_w203.RData')
summary(GSS)
## rincome happy sexnow wwwhr
## $25000 or more: 851 very happy : 701 women :758 Min. : 0.00
## $20000 - 24999: 107 pretty happy :1307 man :640 1st Qu.: 3.00
## $10000 - 14999: 94 not too happy: 336 transgender : 2 Median : 8.00
## $15000 - 19999: 61 DK : 0 a gender not listed here: 1 Mean : 13.91
## lt $1000 : 33 IAP : 0 Don't know : 0 3rd Qu.: 20.00
## (Other) : 169 NA : 0 (Other) : 0 Max. :140.00
## NA's :1033 NA's : 4 NA's :947 NA's :986
## emailhr socrel socommun numpets
## Min. : 0.000 sev times a week:382 never :510 Min. : 0.000
## 1st Qu.: 0.000 sev times a mnth:287 once a month :243 1st Qu.: 0.000
## Median : 2.000 once a month :259 sev times a week:219 Median : 1.000
## Mean : 7.152 sev times a year:240 sev times a year:196 Mean : 1.718
## 3rd Qu.: 10.000 almost daily :217 sev times a mnth:174 3rd Qu.: 2.000
## Max. :100.000 (Other) :171 (Other) :215 Max. :20.000
## NA's :929 NA's :792 NA's :791 NA's :1201
## tvhours major1 owngun
## Min. : 0.000 business administration: 138 yes :537
## 1st Qu.: 1.000 education : 79 no :993
## Median : 2.000 engineering : 54 refused: 39
## Mean : 2.938 nursing : 51 DK : 0
## 3rd Qu.: 4.000 health : 42 IAP : 0
## Max. :24.000 (Other) : 546 NA : 0
## NA's :793 NA's :1438 NA's :779
You have a set of questions that you would like to answer with a statistical test. For each question:
library(tidyverse)
library(magrittr)
library(ggthemes)
GSS %>%
filter(major1 %in% c('computer science', 'economics')) %$%
t.test(tvhours ~ major1)
##
## Welch Two Sample t-test
##
## data: tvhours by major1
## t = 1.7123, df = 11.663, p-value = 0.1133
## alternative hypothesis: true difference in means between group computer science and group economics is not equal to 0
## 95 percent confidence interval:
## -0.3134265 2.5800931
## sample estimates:
## mean in group computer science mean in group economics
## 2.600000 1.466667
GSS %>%
filter(major1 %in% c('computer science', 'economics')) %$%
wilcox.test(tvhours ~ major1)
## Warning in wilcox.test.default(x = c(1L, 2L, 2L, 1L, 7L, 5L, 1L, 3L, 2L, : cannot compute exact
## p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: tvhours by major1
## W = 101, p-value = 0.1372
## alternative hypothesis: true location shift is not equal to 0
GSS %>%
filter(major1 %in% c('computer science', 'economics')) %>%
mutate(tv_category = case_when(
tvhours < 2 ~ 1,
tvhours >= 2 & tvhours < 5 ~ 2,
tvhours >= 5 ~ 3)) %$%
# t.test(tv_category ~ major1)
wilcox.test(tv_category ~ major1)
## Warning in wilcox.test.default(x = c(1, 2, 2, 1, 3, 3, 1, 2, 2, 2), y = c(1, : cannot compute
## exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: tv_category by major1
## W = 99.5, p-value = 0.1379
## alternative hypothesis: true location shift is not equal to 0
GSS %>%
filter(major1 %in% c('computer science', 'economics')) %$%
wilcox.test(tvhours ~ major1)
## Warning in wilcox.test.default(x = c(1L, 2L, 2L, 1L, 7L, 5L, 1L, 3L, 2L, : cannot compute exact
## p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: tvhours by major1
## W = 101, p-value = 0.1372
## alternative hypothesis: true location shift is not equal to 0
GSS %>%
filter(major1 %in% c('computer science', 'economics')) %>%
ggplot() +
aes(x=tvhours) +
geom_histogram(bins=10)
## Warning: Removed 11 rows containing non-finite values (stat_bin).
GSS %>%
select(wwwhr, emailhr) %>%
drop_na() %$%
t.test(x=wwwhr, y=emailhr, paired=TRUE)
##
## Paired t-test
##
## data: wwwhr and emailhr
## t = 13.44, df = 1360, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.530219 7.420553
## sample estimates:
## mean of the differences
## 6.475386
GSS %>%
ggplot() +
geom_histogram(aes(x=wwwhr), fill = 'blue') +
geom_histogram(aes(x=emailhr, fill = 'red'))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 986 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 929 rows containing non-finite values (stat_bin).
t.test(
x = GSS$wwwhr,
y = GSS$emailhr,
paired = FALSE
)
##
## Welch Two Sample t-test
##
## data: GSS$wwwhr and GSS$emailhr
## t = 12.073, df = 2398.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.657397 7.851614
## sample estimates:
## mean of x mean of y
## 13.906021 7.151515
What a great question this is… so glad for the chance to answer it, and to deal with this data that someone else has coded for me. Thank you universe.
wilcox_test_data <- GSS %>%
select(socrel, socommun) %>%
mutate(
family_ordered = factor(
x = socrel,
levels = c('almost daily', 'sev times a week',
'sev times a mnth', 'once a month',
'sev times a year', 'once a year', 'never')),
friends_ordered = factor(
x = socommun,
levels = c('almost daily', 'sev times a week',
'sev times a mnth', 'once a month',
'sev times a year', 'once a year', 'never')))
To begin this investigation, we’ve got to look at the data and see what is in it. If you look below, you’ll note that it sure seems that people are spending more time with their family… erp, actually no. They’re “hanging out” with their friends rather than taking their mother out to dinner.
wilcox_test_data %>%
select(friends_ordered, family_ordered) %>%
rename(
Friends = friends_ordered,
Family = family_ordered
) %>%
drop_na() %>%
pivot_longer(cols = c(Friends, Family)) %>%
ggplot() +
aes(x=value, fill=name) +
geom_histogram(stat='count', position='dodge') +
scale_fill_manual(values = c('#003262', '#FDB515')) +
labs(
title = 'Do Americans Spend Times With Friends or Family?',
subtitle = 'A cutting analysis.',
fill = 'Friends or Family',
x = 'Amount of Time Spent') +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
theme_fivethirtyeight()
## Warning: Ignoring unknown parameters: binwidth, bins, pad
With this plot created, we can ask if what we observe in the plot is the produce of what could just be sampling error, or if this is something that was unlikely to arise due if the null hypothesis were true. What is the null hypothesis? Well, lets suppose that if we didn’t know anything about the data that we would expect there to be no difference between the amount of time spent with friends or families.
## risky choice -- casting the factor to a numeric without checking what happens.
wilcox_test_data %$%
wilcox.test(
x=as.numeric(family_ordered),
y=as.numeric(friends_ordered),
paired=FALSE
)
##
## Wilcoxon rank sum test with continuity correction
##
## data: as.numeric(family_ordered) and as.numeric(friends_ordered)
## W = 716676, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
library(tidyverse)
library(broom)
library(testthat)
theme_set(theme_minimal())
Rear-View Mirror
Today
Looking Ahead
Meet us over at this link to work on some math together.
Consider that you have two variables, \(X_{1}\) and \(X_{2}\) that are predicting \(Y\).
Suppose we have random variables \(X\) and \(Y\).
We make the claim in live session that we can re-represent a coefficient that we’re interested in as a function of all the other variable in a regression. That is, suppose that we were interested, initially, in estimating the model:
\[ Y = \hat\beta_{0} + \hat\beta_{1} X_{1} + \hat\beta_{2} X_{2} + \hat\beta_{3}X_{3} + e \] that we can produce an estimate for \(\hat\beta_{1}\) by fitting this auxiliary regression,
\[ X_{1} = \hat\delta_{0} + \hat\delta_2X_2 + \hat\delta_3X_3 + r_{1} \]
And then using the residuals, noted as \(r_1\) above, in a second auxilary regression,
\[ Y = \gamma_0 + \gamma_1 r_1 \]
The claim that we make in the live session is that there is a guarantee that \(\beta_1 = \gamma_1\). Here, we are first going to show that this is true, and then we’re going to reason about what this means, and why this feature is interesting (or at least useful) when we are estimating a regression.
Suppose that the population model is the following:
\[ Y = -3 + (1\cdot X_1) + (2\cdot X_2) + (3\cdot X_3) \]
d <- data.frame(
x1 = runif(n=100, min=0, max=10),
x2 = runif(n=100, min=0, max=10),
x3 = runif(n=100, min=0, max=10)
)
## because we know the population model, we can produce a single sample from it
## using the following code:
d <- d %>%
mutate(y = -3 + 1*x1 + 2*x2 + 3*x3 + rnorm(n=n(), mean=0, sd=3))
Notice that when we made this data, we included a set of random noise at the end. The idea here is that there are other “things” in this universe that also affect \(Y\), but that we don’t have access to them. By assumption, what we have measured in this world, \(X_1, X_2, X_3\) are uncorrelated with these other features.
model_main <- lm(y ~ x1 + x2 + x3, data = d)
coef(model_main)
## (Intercept) x1 x2 x3
## -4.459032 1.000384 2.166609 3.112126
The claim is that we can produce an estimate of \(\beta_1\) using an auxiliary set of regression estimates, and then using the regression from that auxiliary regression.
model_aux <- lm(x1 ~ x2 + x3, data = d)
If we look into the structure of model_aux we can see
that there are a ton of pieces in here.
## str(model_aux)
To evaluate our claim, we need to find the residuals from this regression. As a knowledge check, what is it that we mean when we say, “residual” in this sense?
To make talking about these easier, here is a plot that might be useful.
d %>%
ggplot() +
aes(x=x1, y=y) +
geom_point() +
geom_segment(aes(x=0, xend=10, y=0, yend=50), color = 'steelblue')
In order to access these residuals, we can “augment” the dataframe
that we used in the model, using the broom::augment
function call.
d_augmented <- augment(model_aux)
d_augmented$y <- d$y
d_augmented
## # A tibble: 100 × 10
## x1 x2 x3 .fitted .resid .hat .sigma .cooksd .std.resid y
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.176 9.06 1.14 4.76 -4.58 0.0399 2.81 0.0375 -1.65 20.6
## 2 7.77 0.185 5.28 5.68 2.08 0.0454 2.85 0.00894 0.751 17.4
## 3 7.57 1.92 9.42 5.95 1.63 0.0590 2.85 0.00729 0.590 35.6
## 4 0.105 3.84 1.49 5.11 -5.01 0.0237 2.81 0.0257 -1.78 5.75
## 5 8.18 3.07 5.89 5.56 2.62 0.0202 2.84 0.00596 0.932 26.5
## 6 8.56 5.28 8.78 5.68 2.88 0.0350 2.84 0.0129 1.03 43.2
## 7 4.22 7.28 1.78 4.92 -0.701 0.0230 2.85 0.000489 -0.250 18.7
## 8 6.34 9.54 4.71 5.05 1.30 0.0302 2.85 0.00223 0.463 33.6
## 9 4.18 4.96 1.41 5.04 -0.856 0.0215 2.85 0.000681 -0.305 11.4
## 10 3.25 1.32 0.994 5.23 -1.98 0.0449 2.85 0.00797 -0.713 1.57
## # … with 90 more rows
And finally, with this augmented data that has information from the model, we can estimate the model that includes only the residuals as predictors of \(Y\).
model_two <- lm(y ~ .resid, data = d_augmented)
coef(model_two)
## (Intercept) .resid
## 26.303218 1.000384
Our claim was that the coefficients from model_main and
model_two should be the same.
test_that(
'the model coefficients are equal',
expect_equal(
as.numeric(coef(model_main)['x1']),
as.numeric(coef(model_two)['.resid']))
)
## Test passed 🥳
But, why is this an interesting, or at least useful, feature to appreciate?
Suppose x and y are variables in dataframe
d.
To fit an ols regression of Y on X:
mod <- lm(y ~ x, data = d)
To access coefficients from the model object:
mod$coefficients
or coef(mod)
To access fitted values from the model object:
mod$fitted
or fitted(mod)
or predict(mod)
To access residuals from the model object:
mod$residuals
or resid(mod)
To create a scatterplot that includes the regression line:
plot(d['x'], d['y'])
abline(mod)
or
d %>%
ggplot() +
aes(x = x, y = y) +
geom_point() +
geom_smooth(method = lm)
Real Estate in Boston
The file hprice1.Rdata contains 88 observations of homes
in the Boston area, taken from the real estate pages of the Boston Globe
during 1990. This data was provided by Wooldridge.
load('data/hprice1.RData') # provides 3 objects
head(data)
## price assess bdrms lotsize sqrft colonial lprice lassess llotsize lsqrft
## 1 300.000 349.1 4 6126 2438 1 5.703783 5.855359 8.720297 7.798934
## 2 370.000 351.5 3 9903 2076 1 5.913503 5.862210 9.200593 7.638198
## 3 191.000 217.7 3 5200 1374 0 5.252274 5.383118 8.556414 7.225482
## 4 195.000 231.8 3 4600 1448 1 5.273000 5.445875 8.433811 7.277938
## 5 373.000 319.1 4 6095 2514 1 5.921578 5.765504 8.715224 7.829630
## 6 466.275 414.5 5 8566 2754 1 6.144775 6.027073 9.055556 7.920810
Suppose that you’re interested in knowing the relationship between price and square footage.
Assess the assumptions of the Large-Sample Linear Model.
Create a scatterplot of price and
sqrft. Like every plot you make, ensure that the plot
minimally has a title and meaningful axes.
Find the correlation between the two variables.
Recall the equation for the slope of the OLS regression line –
here you can either use Variance and Covariance, or if you’re bold, the
linear algebra. Compute the slope manually (without using
lm())
Regress price on sqrft using the
lm function. This will produce an estimate for the
following model:
\[ price = \beta_{0} + \beta_{1} sqrft + e \]
Create a scatterplot that includes the fitted regression.
Interpret what the coefficient means.
\[ price = \beta_{0} + \beta_{1} sqrft + \beta_{2} lotsize + \beta_{3} colonial? + e \]
In this next set of notes, we’re going to give some data, displayed in plots, and we will try to apply what we have learned in the async and reading for this week to answer questions about each of the scatter plots.
Consider data that is generated according to the following function:
\[ Y = 1 + 2x_1 + 3x_2 + e, \]
where \(x_1 \sim N(0,2)\), \(x_2 \sim N(0,2)\) and \(e\) is a constant equal to zero.
From this population, you might consider taking a sample of 100 observations, and representing this data in the following 3d scatter plot. In this plot, there are three dimensions, an \(x_1, x_2\), and \(y\) dimensions.
knitr::include_app(url ="http://www.statistics.wtf/minibeta_l01/")
Rear-View Mirror
Today
Looking Ahead
| outcome: sleep hours | |
|---|---|
| mg. melatonin | 0.52 |
| (0.31) |
Under the relatively stricter assumptions of constant error variance, the variance of a slope coefficient is given by
\[ V(\hat{\beta_j}) = \frac{\sigma^2}{SST_j (1-R_j^2)} \]
A similar formulation is given in FOAS as definition 4.2.3,
\[ \hat{V}_{C}[\hat{\beta}] = \hat{\sigma}^2 \left( X^{T} X \right)^{-1} \rightsquigarrow \frac{\hat{\sigma}^{2}}{\left( X^{T}X\right)} \]
Explain why each term makes the variance higher or lower:
Real Estate in Boston
The file hprice1.RData contains 88 observations of homes
in the Boston area, taken from the real estate pages of the Boston Globe
during 1990. This data was provided by Wooldridge.
load('data/hprice1.RData') # provides 3 objects
Last week, we fit a regression of price on square feet.
model_one <- lm(price ~ sqrft, data = data)
model_one
##
## Call:
## lm(formula = price ~ sqrft, data = data)
##
## Coefficients:
## (Intercept) sqrft
## 11.2041 0.1402
Questions
\[ price = \beta_{0} + \beta_{1} sqrft + \beta_{2} lotsize + \beta_{3} colonial? + e \]
colonial
and lotsize are allowed to estimate freely, or instead are
restricted to be zero (i.e. \(\beta_{2} =
\beta_{3} = 0\).Use the function vcovHC from the
sandwich package to estimate (a) the the heteroskedastic
consistent (i.e. “robust”) variance covariance matrix; and (b) the
robust standard errors for the intercept and slope of this regression.
Recall, what is the relationship between the VCOV and SE in a
regression?
Perform a hypothesis test to check whether the population
relationship between sqrft and price is zero.
Use coeftest() with the robust standard errors computed
above.
Use the robust standard error and qt to compute a
95% confidence interval for the coefficient sqrft in the
second model that you estimated. \(price =
\beta_{0} + \beta_{1} sqrft + \beta_{2} lotsize + \beta_{3}
colonial\).
Bootstrap. The book very quickly talks about bootstrapping which is the process of sampling with replacement and fitting a model. The idea behind the bootstrap is that since the data is generated via an iid sample from the population, that you can simulate re-running your analysis by drawing repeated samples from the data that you have.
Below is code that will conduct a boostrapping estimator of the
uncertainty of the sqrft variable when lotsize
and colonial are included in the model.
bootstrap_sqft <- function(d = data, number_of_bootstraps = 1000) {
number_of_rows <- nrow(d)
coef_sqft <- rep(NA, number_of_bootstraps)
for(i in 1:number_of_bootstraps) {
bootstrap_data <- d[sample(x=1:number_of_rows, size=number_of_rows, replace=TRUE), ]
estimated_model <- lm(price ~ sqrft + lotsize + colonial, data = bootstrap_data)
coef_sqft[i] <- coef(estimated_model)['sqrft']
}
return(coef_sqft)
}
bootstrap_result <- bootstrap_sqft(number_of_bootstraps = 1000)
With this, it is possible to plot the distribution of these regression coefficients:
ggplot() +
aes(x = bootstrap_result) +
geom_histogram() +
labs(
x = 'Estimated Coefficient',
y = 'Count',
title = 'Bootstrap coefficients for square footage'
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Compute the standard deviation of the bootstrapped regression coefficients. How does this compare to the robust standard errors you computed above?
Rearview Mirror
Statisticians create a population model to represent the world.
The BLP is a useful way to summarize the relationship between one outcome random variable \(Y\) and input random varibles \(X_1,...,X_k\)
OLS regression is an estimator for the Best Linear Predictor (BLP)
We can capture the sampling uncertainty in an OLS regression with standard errors, and tests for model parameters.
Today
The research goal determines the strategy for building a linear model.
Description means summarizing or representing data in a compact, human-understandable way.
We will capture complex relationships by transforming data, including using indicator variables and interaction terms.
Looking Ahead
We will see how model building for explanation is different from building for description.
The famous Classical Linear Model (CLM) allows us to apply regression to smaller samples.
How does the modeling goal influence each of the following steps in the statistical modeling process?
Choice of variables and transformation
Choice of model (ols regression, neural nets, random forest, etc.)
Model evaluation
#?wage1
#names(wage1)
Which of the following specifications best capture the relationship between education and hourly wage? (Hint: Do a quick a EDA)
What is the interpretation of \(\beta_0\) and \(\beta_1\) in your selected specification?
Can we use \(R^2\) or Adjusted \(R^2\) to choose between level-level or log-level specifications?
Remember
The following specifications include two control variables: years of experience (exper) and years at current company (tenure).
Do a quick EDA and select the specification that better suits our description goal.
\(wage = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 tenure + u\)
\(\begin{aligned} wage &= \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + \\ & \beta_4 tenure + \beta_5 tenure^2 + u \end{aligned}\)
How do you interpret the \(\beta\) coefficients?
In the following models, first, explain why the indicator variables or interaction terms have been included. Then identify the reference group (if any) and interpret all coefficients.
\(wage = \beta_0 + \beta_1 educ + \beta_2 I(educ \geq 12) + u\)
\(wage = \beta_0 + \beta_1 educ + \beta_2 female + u\)
\(wage = \beta_0 + \beta_1 educ + \beta_2 female + \beta_3 educ*female + u\)
\(\begin{aligned} wage &= \beta_0 + \beta_1 female + \beta_2 I(educ = 2) + \beta_3 I(educ = 3)\\ &...+ \beta_{20} I(educ = 20) + u\\ \end{aligned}\)
Estimating Returns to Education
Lab 2-Regression
Overview
The Report
- Writing for a collaborating data scientist, what research question
have you asked, what answers have you found, and how did you find
them?
| Deliverable Name | Week Due | Grade Weight |
|---|---|---|
| Research Proposal | Week 12 | 10% |
| Within-Team Review | Week 12 | 5% |
| Final Presentation | Week 14 | 10% |
| Final Report | Week 14 | 75% |
Team Work Evaluation
Final Presentation
Rearview Mirror
Today
Looking Ahead
\[ \begin{matrix} \\ \text{Sleep} \rightarrow \text{Feelings of Stress} \\ \\ \end{matrix} \]
Omitted Variable Bias in R
The file htv.RData contains data from the 1991 National
Longitudinal Survey of Youth, provided by Wooldridge. All people in the
sample are males age 26 to 34. The data is interesting here, because it
includes education, stored in the variable educ, and also a
score on an ability test, stored in the variable abil.
Assume that the true model is,
Questions:
1- Are we able to directly measure ability? If so, how would you propose to measure it?
2- If not, what do we measure and how is this measurement related to ability? And there is a lot of evidence to suggest that standardized tests are not a very good proxy. But for now, let’s pretend that we really are measuring ability.
3- Using R, estimate (a) the true model, and (b) the regression of ability on education.
Write down the expression for what omitted variable bias would be if you couldn’t measure ability.
Add this omitted variable bias to the coefficient for education to see what it would be.
4- Now evaluate your previous result by fitting the model, \[wage = \alpha_0 + \alpha_1 educ + w\]
Does the coefficient for the relationship between education and wages match what you estimated earlier?
Why or why not?
5- Reflect on your results:
What does the direction of omitted variable bias suggest about OLS estimates of returns to education?
What does this suggest about the reported statistical significance of education?
The Direction of Omitted Variable Bias
| Regression Output | Omitted Variable |
|---|---|
| \(\widehat{grade} = 72.1 + 0.4\ attendance\) | \(time\_studying\) |
| \(\widehat{lifespan} = 87.4 - 1.2\ cigarettes\) | \(exercise\) |
| \(\widehat{lifespan} = 87.4 - 1.2\ cigarettes\) | \(time\_socializing\) |
| \(\widehat{wage} = 14.0 + 2.1\ grad\_education\) | \(experience\) |
| \(\widehat{wage} = 14.0 + 2.1\ grad\_education\) | desire to effect \(social\_good\) |
| \(\widehat{literacy} = 54 + 12\ network\_access\) | \(wealth\) |
Rearview Mirror
Today
Looking Ahead
Part I
Part II
Part III
There are five requirements for the CLM
For each of these requirements:
library(tidyverse)
library(wooldridge)
library(car)
library(lmtest)
library(sandwich)
library(stargazer)
If you haven’t used the mtcars dataset, you haven’t been
through an intro applied stats class!
In this analysis, we will use the mtcars dataset which is a dataset that was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). The dataset is automatically available when you start R. For more information about the dataset, use the R command: help(mtcars)
data(mtcars)
glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, …
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 167.6, 167.6, 275.…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230,…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07, 3.0…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.440, 3.440, 4.07…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18.30, 18.90, 17.4…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, …
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, …
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, …
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, …
Questions:
Using the mtcars data, run a multiple linear regression to find
the relationship between displacement (disp), gross
horsepower (hp), weight (wt), and rear axle
ratio (drat) on the miles per gallon
(mpg).
For each of the following CLM assumptions, assess whether the assumption holds. Where possible, demonstrate multiple ways of assessing an assumption. When an assumption appears violated, state what steps you would take in response.
In addition to the above, assess to what extent (imperfect) collinearity is affecting your inference.
Interpret the coefficient on horsepower.
Perform a hypothesis test to assess whether rear axle ratio has an effect on mpg. What assumptions need to be true for this hypothesis test to be informative? Are they?
Choose variable transformations (if any) for each variable, and try to better meet the assumptions of the CLM (which also maintaining the readability of your model).
(As time allows) report the results of both models in a nicely formatted regression table.
Rearview Mirror
Today
Looking Ahead
Business, policy, education and medical decisions are made by humans based on data
A central task when we observe some pattern in data is to infer whether the pattern will occur in some novel context
Statistics, as we practice it in 203, allows us to characterize:
So that we can either describe, explain or predict behavior.
Statistical models are unknowing transformations of data
As the data scientist, bring them alive to achieve our modeling goals
In Lab 2 we have expanded our ability to parse the world using regression, built a model that accomplishes our goals, and done so in a way that brings the ability to test under a “null” scenario
We can:
Green Jelly Beans
What went wrong here?
Status Update You have a dataset of the number of Facebook status updates by day of the week. You run 7 different t-tests, one for posts on Monday (versus all other days), or for Tuesday (versus all other days), etc. Only the test for Sunday is significant, with a p-value of .045, so you throw out the other tests.
Should you conclude that Sunday has a significant effect on number of posts? (How can you address this situation responsibly when you publish your results?)
Such Update As before, you have a dataset of the number of Facebook status updates by day of the week. You do a little EDA and notice that Sunday seems to have more “status updates” than all other days, so you recode your “day of the week” variable into a binary one: Sunday = 1, All other days = 0. You run a t-test and get a p-value of .045. Should you conclude that Sunday has a significant effect on number of posts?
Sunday Funday Suppose researcher A tests if Monday has an effect (versus all other days), Researcher B tests Tuesday (versus all other days), and so forth. Only Researcher G, who tests Sunday finds a significant effect with a p-value of .045. Only Researcher G gets to publish her work. If you read the paper, should you conclude that Sunday has a significant effect on number of posts?
Sunday Repentence What if researcher G above is a sociologist that chooses to measure the effect of Sunday based on years of observing the way people behave on weekends? Researcher G is not interested in the other tests, because Sunday is the interesting day from her perspective, and she wouldn’t expect any of the other tests to be significant.
Decreasing Effect Sizes Many observers have noted that as studies yielding statistically significant results are repeated, estimated effect sizes go down and often become insignificant. Why is this the case?
salvation mountain
Rearview Mirror: What We’ve Seen
Today
We have the tools to use data to infer information about the (joint) distribution
Because the joint distribution is complicated, we’ll usually estimate simpler summaries of the joint distribution – e.g. \(E[X]\), \(V[X]\), \(E[Y|X]\), \(Cov[X,Y]\)
There are a number of techniques that you can use to develop an estimator for a parameter. These techniques vary in terms of the principle used to arrive at the estimator and the strength of the assumptions needed to support it.
However, all of these estimators are statistics meaning they are functions of the data \(\{X_i\}_{i=1}^n\)
optimize, and
optim). optimize is the simplest to use, but
only works in one dimension.\[ \text{profit} = (p - p^2) - c + 100 \]
Explain how you would use calculus to find the maximizing price. Assume that cost is fixed.
What is the firms revenue as p=0, cost = 2? What is
it at p=10, cost = 2?
Create a plot with the following characteristics:
seq()) of prices from [0,
10].c=2.Solve this numerically in R, using the optimize() function.
optimize
try to find a maximum or a minimum?profit <- function(p, c) {
r = (p - p^2) - c + 100
return(r)
}
profit(p=2, c=2)
## [1] 96
best_price <- optimize(
profit, # profit is the function
lower = 0, upper = 1000, # this is the low and high we consider
c = 2, # here, we're passing cost into profit
maximum = TRUE) # we'd like to maximize, not minimize
best_price
## $maximum
## [1] 0.5
##
## $objective
## [1] 98.25
\[ f(x;\lambda) = \frac{\lambda^x e^{-\lambda}}{x!} \]
Here is an example of a string of outcomes generated by a Poisson RV, with parameter \(\lambda = 2\).
rpois(n = 10, lambda = 2)
## [1] 2 1 2 2 2 2 3 1 2 2
data <- c(
2, 6, 2, 1, 3, 3, 4, 4, 24, 1, 5, 4, 5, 1, 2, 2, 5, 2, 1, 5,
2, 1, 2, 9, 9, 1, 3, 2, 1, 1, 3, 1, 3, 2, 2, 4, 1, 1, 5, 3,
3, 2, 2, 1, 1, 1, 5, 1, 3, 1, 1, 1, 1, 2, 2, 4, 2, 1, 2, 2,
3, 1, 2, 6, 2, 2, 3, 2, 3, 5, 1, 3, 2, 5, 2, 1, 3, 2, 1, 2,
4, 2, 6, 1, 2, 2, 3, 5, 2, 1, 4, 2, 2, 1, 3, 2, 2, 4, 1, 1,
1, 1, 2, 3, 5, 1, 2, 2, 3, 1, 4, 1, 3, 2, 2, 2, 2, 2, 2, 3,
3, 1, 1, 2, 2, 4, 1, 5, 2, 7, 5, 2, 3, 2, 5, 3, 1, 2, 1, 1,
2, 3, 1, 5, 3, 4, 6, 3, 3, 2, 2, 1, 2, 2, 4, 2, 3, 4, 3, 1,
6, 3, 1, 2, 3, 2, 2, 3, 1, 1, 1, 1, 1, 10, 3, 2, 1, 1, 3, 2,
2, 3, 1, 1, 2, 2, 2, 4, 2, 2, 3, 3, 6, 1, 3, 2, 3, 2, 2, 2
)
table(data)
## data
## 1 2 3 4 5 6 7 9 10 24
## 54 69 38 14 14 6 1 2 1 1
Use the data that is stored in data, together with a
Poisson model to estimate the \(\lambda\) values that produce the “good”
model from the Poisson family.
That is, use MLE to estimate \(\lambda\).
Here is your work flow:
poisson_ll <- function(data, lambda) {
## fill this in:
lambda # this is a placeholder, change this
}
search_space <- seq(0,100, by = 0.1)
plot(
x = search_space, xlab = 'Search Space',
y = poisson_ll(data=data, lambda=search_space), ylab = 'Log Likelihood',
type = 'l'
)
# optimize(poisson_ll, lower = 0, upper = 100, data = data, maximum = TRUE)
This exercise is meant to demonstrate what the confidence level in a confidence interval represents. We will assume a standard normal population distribution and simulate what happens when we draw a sample and compute a confidence interval.
Your task is to complete the following function so that it,
n draws from a standard normal
distributionYour function should return a vector of length 2, containing the lower bound and upper bound of the confidence interval.
\[ CI_{\alpha} = \overline{X} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}} \]
where:
qt),sim_conf_int <- function(n, alpha) {
# Fill in your code to:
# 1. simulate n draws from a standard normal dist.
# 2. compute a confidence interval with confidence level alpha
sample_draws <- 'fill this in'
sample_mean <- 'fill this in'
sample_sd <- 'fill this in'
critical_t <- 'fill this in'
ci_95 <- 'fill this in'
return(ci_95)
}
sim_conf_int(n = 100, alpha = 0.25)
## [1] "fill this in"
When your function is complete, you can use the following code to run your function 100 times and plot the results.
many_confidence_intervals <- function(num_simulations, n, alpha) {
## args:
## - num_simulations: the number of simulated confidence intervals
## - n: the number of observations in each simulation that will pass
## into your `sim_conf_int` function
## - alpha: the confidence interval that you will pass into
## your `sim_conf_int` function
results <- NULL
for(i in 1:num_simulations) {
interval = sim_conf_int(n, alpha)
results = rbind(results, c(interval[1], interval[2], interval[1]<0 & interval[2]>0))
}
resultsdf = data.frame(results)
names(resultsdf) = c("low", "high", "captured")
return(resultsdf)
}
n = 20
confidence_intervals = many_confidence_intervals(100, n, .05)
plot_many_confidence_intervals <- function(c) {
plot(NULL, type = "n",
xlim = c(1,100), xlab = 'Trial',
ylim = c(min(c$low), max(c$high)), ylab=expression(mu),pch=19)
abline(h = 0, col = 'gray')
abline(h = qt(0.975, n-1)/sqrt(n), lty = 2, col = 'gray')
abline(h = qt(0.025, n-1)/sqrt(n), lty = 2, col = 'gray')
points(c$high, col = 2+c$captured, pch = 20)
points(c$low, col = 2+c$captured, pch = 20)
for(i in 1:nrow(c)) {
lines(c(i,i), c(c$low[i],c$high[i]), col = 2+c$captured[i], pch = 19)
}
title(expression(paste("Simulation of t-Confidence Intervals for ", mu,
" with Sample Size 20")))
legend(0,-.65, legend = c(expression(paste(mu," Captured")),
expression(paste(mu," Not Captured"))), fill = c(3,2))
}
# plot_many_confidence_intervals(confidence_intervals)
Part I
Suppose that you’ve got a particular sequence of values: \({1, 0, 0, 1, 0, 1, 1, 1, 1, 1}\) that indicate whether a printer any particular time you try to print.
You have data from the last 10 times you tried.
Question:
bbc, office space
Part II
The data resembles draws from a Bernoulli distribution.
However, even if we want to model this as a Bernoulli distribution, we do not know the value of the parameter, \(p\).
1- Define your random variable.
2- Write down the likelihood function
3- If it will make the math easier, log the likelihood function.
4- Path 1: Maximize the likelihood using calculus
5- Path 2: Maximize using numeric methods.
An effective rubric for student understanding is attributed to Bloom (1956). Referred to as Bloom’s Taxonomy, this proposes that there is a hierarchy of student understanding; that a student may have one level of reasoning skill with a concept, but not another. The taxonomy proposes to be ordered: some levels of reasoning build upon other levels of reasoning.
In the learning objective that we present in for each live session, we will also identify the level of reasoning that we hope students will achieve at the conclusion of the live session.
UC Berkeley, School of Information↩︎
Note, when we say “shaped” here, we’re referring to the deeper concept of a statistical functional. A statistical functional is a function of a function that maps to a real number. So, if \(T\) is the functional that we’re thinking of, \(\mathcal{F}\) is a family of functions that it might operate on, and \(\mathbb{R}\) is the set of real numbers, a statistical functional is just \(T: \mathcal{F} \rightarrow \mathbb{R}\). The Expectation statistical functional, \(E[X]\) always has the form \(\int x f_{X}(x)dx\).)↩︎
Take a moment to strategize just a little bit before you get going on this one. There is a way to compute this value that is easier than another way to compute this value.↩︎